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0 Method and apparatus for communicating data between multiple tasks In data processing systems. 



© In a data processing system of the type includ- 
ing a plurality of processing elements interconnected 
with each other and with a plurality of memory 
elements by an interconnection means, a method is 
provided for accommodating the accessing of a se- 
lected memory location in a selected one of the 
memory elements by at least one requesting pro- 
cessing element to read data stored thereat The 
method thereby permits the communication of in- 
formation between the plurality of processing ele- 
ments. The method compnsing the steps of: send- 

aing, from the requesting processing element to the 
selected memory element via the interconnecting 
(O means, a directive including 1) the address of the 
^selected memory location. 2) a comparison value, 
^and 3) an identification of the requesting processing 
element sending the directive; reading in the se- 



lected memory element the data stored at the se- 
lected memory location; comparing In the memory 
O element the read data with the comparison value; 
^notifying the requesting processing element via the 
LU interconnection means if the read data matches the 
comparison value; storing the directive in the se- 
lected memory element if tiie read data does not 



match the comparison value: and repeating the read- 
ing, comparing, and notifying steps each time the 
data in the selected memory location is altered. 
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METHOD AND APPARATUS FOR COMMUNICATING DATA BETWEEN MULTIPLE TASKS IN DATA PRO- 
CESSING SYSTEMS 



The present invention relates generally to data 
processing and more particularly to a method and 
apparatus for communicating data between multiple 
tasks through the use of a memory directive oper- 
ative to monitor the status of a semaphore stored 
at a memory location provided by the requesting 
task. Once initiated, the directive executes repeat- 
edly without the intervention of the requesting task. 
The invention is applicable to both uniprocessor 
and multiprocessor systems executing multiple 
tasks. 

In the past several decades, significant ad- 
vances have been made in the direction of increas- 
ing the density and speed of basic electronic ele- 
ments such as transistors. In accordance with 
these gains, simitar advances have been made in 
the speed and hence computing power of elec- 
tronic microprocessors, such as the Intel model 
80386 microprocessor. In fact, these achievements 
have been so substantial that in many circum- 
stances the speed and density limitations now be- 
ing encountered in semiconductor devices are ba- 
sic physical limitations, such as the speed with 
which electrons can propagate through an electrical 
conductor. Further improvements along these lines 
will thus involve significant advances in the state of 
the art. achievable only at similarly significant ex- 
pense. 

One area of computing which is not. however, 
subject to the physical limitations descnbed above, 
and hence which is ripe for further improvements in 
speed and power, is that of increasing the effi- 
ciency of use of processing systems. 

One type of computing system particularly npe 
for improvements in processing efficiency is that 
known in the art as parallel pTocessing. In a parallel 
processing system, multiple microprocessors of the 
type described above are connected in electronic 
configurations which permit them to perform sepa- 
rate computing tasks, each task divided out of a 
larger application program, or parent. Tasks can 
compnse two types, parent tasks and child tasks, 
the former including control and synchronization 
information for the latter. 

In a true parallel processing system, each of 
the multiple processors has access to shared com- 
mon memory, has access to at least a portion of 
the system inputoulput (I/O.) and is controlled by a 
single operating system providing interaction be- 
tween the processors and the programs they are 
executing. Theoretically, then, it is possible to di- 
vide a large program between N parallel tasks, 
each task running m a separate processor, and 
complete the program a factor of N times faster 



than any single processor could complete the job 
alone. 

Many different system configurations are 
known for connecting the multiple processors, and 

5 related system memory and 10 elements, to func- 
tion in the manner descnbed above. These configu- 
rations include time-share bus configurations 
wherein the system elements are interconnected 
via a time-shared data link, crossbar configurations 

10 wherein the system elements are connected via an 
arrangement of matnx switches, and multiple- 
bus multiport systems wherein processing and 10 
elements are connected to multiple memory ports 
via multiple buses. Each system configuration has 

75 associated with tt different advantages and dis- 
advantages, many of which are still under inves- 
tigation and open to debate between those skilled 
in the art. For a general discussion of multiproces- 
sor performance, the reader is directed to an article 

20 in the IEEE PROCEEDINGS OF THE 1985 INTER- 
NATIONAL CONFERENCE ON PARALLEL PRO- 
CESSING, pgs. 772-781. "A Methodology for Pre- 
dicting Multiprocessor Performance", by A. Norton, 
et al. For a more thorough description of one 

25 particular type of parallel processing system, the 
reader is directed to an article in the IEEE PRO- 
CEEDINGS OF THE 1985 INTERNATIONAL CON- 
FERENCE ON PARALLEL PROCESSING, pages 
764-771. "The IBM Research Parallel Processor 

30 Prototype rRP3): Introduction and Architecture", by 
G.F. Pfister. et al. References to the IBM RP3 
parallel processor will be made throughout this 
document for the purpose of exemplifying features 
typically found in parallel processing systems, tt is 

35 to be understood that the invention set out below is 
in no way limited by the constructs of the RP3 
system. 

One problem area common to each of the 
different types of parallel processing configurations 

40 is that of resource Allocation and management 
That IS, once the program has* been parsed into 
tasks af^enabie to parallel processing, the separate 
tasks must be scheduled, a selected processor 
assigned to each, and the system memory and I/O 

45 resources allocated so as to efficiently utilize the 
parallel processing capabilities, if this resource al- 
location and management is not well handled, 
much of the above-descnbed theoretical efficiency 
of parallel processing is lost. 

50 In prior art parallel processing systems, two 

general methods are provided for utilizing shared 
resources. The first method, processor signaling, 
involves the use of pnmitives initiateo within each 
processor to notify one or more other processors of 
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event occurrences. For the purposes of this docu- 
ment, a primitive is defined as a non-decouplable 
operation, or an operation in the execution of which 
no ofher operation can overlap. Sample primitives 
include, of course, adds and subtracts in the pro- 
cessing element hardware, and fetches and stores 
in the memory hardware. 

Processor signaling requires a substantial 
amount of sophisticated software programming, 
and is perfectly adequate for course grain parallel* 
ism such as the FORKs and JOINs used to divide 
and subsequently re-join large tasks within parallel 
processed programs. As the parallelism becomes 
increasingly fine, however, (i.e. as more and small- 
er tasks in the program are divided out for parallel 
processing), the overhead associated with proces- 
sor signalling becomes unacceptably large, de- 
creasing the efficiency of the parallel processing to 
an unacceptable extent. 

A second method of utilizing shared resources 
is that of using memory semaphores, i.e. indivisa- 
ble modification of memory content to signal the 
availability or unavailability of a particular resource. 
This second method is alternately referred to as 
the use of "lock-outs", "shoulder tapping", or 
"mailboxing'*. each referring to the use of a particu- 
lar message or code placed in a particular memory 
location to notify other processors of the status of a 
resource. In systems .employing such memory 
semaphores, a processor which is waiting on* the 
availability of a particular resource must read the 
memory location containing the code relating to the 
status of that resource, and continue its operation 
accordingly. If the memory semaphore indicates 
the resource Is unavailable, then, in prior art sys- 
tems, the inquiring microprocessor enters a wait 
state wherein processing is halted, this wait state 
being punctuated by periodic re-readings of the 
memory semaphore. This status of repeated waits 
and reads is known in the art as a "spin loop". 

Memory semaphores are perfectly acceptable 
for the coarse grain parallelism descnbed above. 
However, as the parallelism becomes increasingly 
fine, and more tasks are running concurrently, the 
number of these spin loops increases significantly. 
As the number of the spin loops increases, the 
system hardware providing the interconnection be- 
tween the processors and memory, i.e. the bus. 
switching matrix, etc. as described above, encoun- 
ters regions of interferences caused by conflicting 
memory accesses. This problem can result in "hot 
spots", or portions of the processor interconnection 
hardware which become too overloaded with these 
conflicting memory accesses to continue support- 
ing the processing. The system thus experiences 
highly inefficient, unacceptable delays. 

One already known method of diminishing the 
undesirable formation of these hot spots is that of 



combining multiple fetch or read requests for a 
single memory location. According to this method, 
the responsibility for notifying ail of the processors 
waiting on the particular memory location is as- 

5 signed to a single processor. This method, while 
functioning to some extent to relieve hot spots, is 
subject to several disadvantages. First, the effi- 
ciency of such con^bination is dependant on the 
lucky collisions or overlapping of requests for the 

10 same memory location. Such schemes require ad- 
ditional code and storage resources to manipulate 
the lists. The cost in hardware of building the 
interconnect networks required to support such 
combining is very high. Further, if the single pro- 

15 cesser having the notification responsibility should 
fail, continued operation of the system may be 
seriously compromised. For a more thorough dis- 
cussion of hot spots in general, the reader is di- 
rected to an article in the IEEE PROCEEDINGS OF 

20 THE 1986 INTERNATIONAL CONFERENCE ON 
PARALLEL PROCESSING, pgs. 28-34. "The Onset 
of Hot Spot Contention", by M. Kumar, et al. For a 
discussion of hot spots and combining, the reader 
is directed to an article in the IEEE PROCEED- 

25 INGS OF THE 1985 INTERNATIONAL CONFER- 
ENCE ON PARALLEL PROCESSING, pgs. 790- 
797. "'Hot Spot' Contention and Combining in Mul- 
tistage Interconnection Networks", by G.F. Pflster. 
et al. 

30 The following patents are of interest as show- 
ing processing systems having some degree of 
redundancy or automatic error detection to prevent 
processing system failures. They do not address 
the problems of parallel processing systems recog- 

35 nized by the present invention. 

U.S. Pat. No. 4.456.952 to Mohrman et al. 
shows a data processing system having redundant 
control processors for fault detection. Comparators 
are provided lor comparing the operation of the 

40 dual processors. Fault processing circuitry is pro- 
vided for detecting errors between the two proces- 
sors and identifying which of the processors is not 
operating properly. 

U.S. Pat. No. 4.500.959 to Kubo et al. shows a 

45 computing system including a main memory and 
an directive buffer. An inconsistency arises when 
an directive in the main memory is changed after 
that same directive is stored in the directive buffer. 
The system operates to identify such Inconsisten- 

50 cies, and to invalidate the contents of the directive 
buffer when the changed directive is to be ex- 
ecuted. 

U.S. Pat. No. 4.118.789 to Casto et al. shows a 
programmable controller having a control program 
55 with a protected portion. Each time the control 
program is run, the protected portion is compared 
against a corresponding program stored in a ROM. 
If the protected portion of the control program does 
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not correspond vvith the ROM-stored program, an 
error signal is generated and execution is pre- 
vented. 

U.S Pat. No. 3.879.711 to Boaron shows a 
digital data processing system including a data 
processing unit, a central memory unit, and a con- 
trol unit. A sentinel memory is provided for receiv- 
ing a programmed directive. A comparator is pro- 
vioed for comparing the contents of the directives 
register of the central memory unit with the con- 
tents of the sentinel memory. The comparator pro- 
vides a control signal to the control unit when the 
contents are identical. 

While the formation of hot spots as described 
above IS a problem peculiar to parallel processing 
systems, it will be appreciated that systems em- 
ploying single processors, i.e. uniprocessor sys- 
tems, also suffer from problems associated with the 
synchronizing of multiple tasks. In uniprocessor 
systems, large, complex programs are typically 
broken down for execution into smaller, separately 
executable tasks analogous to the child tasks de- 
scnbed above. The operating system is then re- 
sponsible for synchronizing the execution of the 
vanous tasks. Such synchronization might include, 
for example, that necessary to temporarily block a 
task pending the completion of a data 1.0 opera- 
tion, and subsequently awaken the task when the 
operation is complete. 

* Synchronizing multiple tasks in a uniprocessor 
system typically requires the extensive use of a 
"polling " operation, whereby the operating system 
reads semaphores of the type described above to 
check the status of various tasks. The results of 
this polling can then be used to change the status 
of tasks, as appropriate. This polling, however, re- 
quires a substantial quantity of system resources, 
particularly of processing time. As the number of 
tasks requiring synchronization increases, the pol- 
ling increases accordingly. Eventually, a substantial 
quantity of processing time becomes tied up m 
task synchronization, detrimentally affecting the 
system resources available for actual processing. 
For a discussion of task states, the reader ts di- 
rected to "An Introduction to Processing Systems'*, 
by H.M. Deitei. Addison-Wesley Publishing Com- 
pany. Inc.. 1984. pgs. 63-72. For a fu.iher discus- 
sion of synchronizing tasks in uniprocessor and 
multiprocessor environments, the reader is directed 
to "Software Engineering with Ada", by G. Booch. 
Benjamin Cummings Publishing Co.. 1983. pgs. 
231-235. 

The principal object of the present invention is 
to provide a new and improved method and ap- 
paratus for communicating data between multiple 
tasks in data processing systems. 

Another object of the present invention is to 
provide such a method and apparatus through the 



use of semaphores in selected memory locations. 

A further object of the present invention is to 
provide such a method and apparatus for commu- 
nicating data between multiple tasks in a un- 

5 iprocessor processing system. 

Yet another object of the present invention is to 
provide such a method and appartus for commu- 
nicating data between multiple tasks in a multiple 
processing system. 

w A more specific object of the present invention 
IS to provide a method and apparatus for commu- 
nicating data between multiple tasks in a process- 
ing system through the use of a directive operative 
in a memory element to monitor the status of a 

15 semaphore. 

In accordance with a first embodiment of the 
present invention, there is provided for use in a 
data processing system of the type including a 
plurality of processing elements interconnected 

20 with each other and with at least one memory 
element by an interconnection means, a new and 
improved method accommodating the accessing of 
a selected memory location m a selected memory 
element by at least one requesting processing ele- 

25 ment to read data stored thereat. The method 
thereby permits the communication of information 
between the plurality of processing elements. The 
method comprises the steps of: sending, from the 
requesting processing element to the selected 

30 memory element, a directive including 1) the ad- 
dress of the selected memory location, and 2) an 
identification of the requesting processing element 
sending the directive: reading m the selected mem- 
ory element the data stored at the selected mem- 

35 ory location: sending the read data obtained by the 
reading step to the requesting processing element 
via the interconnection means; storing the directive 
in the memory element: and repeating the reading 
and sending steps each time the data in the se- 

^ lected memory location is altered. 

In accordance with another embodiment of the 
present invention, the above directive includes a 
comparison value, and. m lieu of sending the read 
data each time the memory location is altered, the 

•45 method compares the read data with the compari- 
son value and notifies the requesting processing 
element when the data read from the selected 
memory location matches the comparison value. 
In accordance with another embodiment of the 

so present invention, there ts provided for use in a 
data processing system of the type including a 
processing element interconnected with at least 
one memory element by an interconnection means, 
a method accommodating the accessing of at least 

55 one selected memory location by at least one 
requesting task to read data stored thereat. The 
method compnses the steps of: sending from the 
processing element to the memory element a di- 
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rective including 1) the address of the selected 
memory location, and 2) an identification of the 
requesting task; reading in the memory element 
the data stored at the selected memory location: 
sending the read data obtained by the reading step 
to the processing element with the task identifica- 
tion; storing the directive in the memory element: 
and repeating the reading and sending steps when 
the data in the selected memory location is altered. 

In accordance with another embodiment of the 
present invention, the above directive includes a 
comparison value, and, in lieu of sending the read 
data when the memory location is altered, the 
method compares the read data with the compan- 
son value and notifies the processing element 
when the data read from the selected memory 
location matches the comparison value. 

In yet another embodiment of the present in- 
vention, there is provided for use in a data process- 
ing system of the type including at least one pro- 
cessing element interconnected with at least one 
memory element by an interconnection means, a 
method of executing a primitive insti-uction in ac- 
cordance with a directive stored in the memory 
element. The method comprises the steps of: read- 
ing the directive including 1) an address of a se- 
lected memory location, and 2) an identification of 
a requesting processing element: reading tiie data 
at the address: sending the read data to the pro- 
cessing element: and repeating the steps of read- 
ing the directive, reading the data, and sending the 
data when the data in the selected memory loca- 
tion is altered. 

In accordance with a further embodiment of the 
present invention, the above directive includes a 
comparison value, and. in lieu of sending the read 
data each time tiie memory location is altered, the 
method compares tiie read data with the compari- 
son value and notifies tiie requesting processing 
element when the data read from the selected 
memory location matches the comparison value. 

These and otiier objects, features, and advan- 
tages of the present invention wilt become appar- 
ent to those skilled in the art upon a consideration 
of the detailed description of the invention set out 
below when read in conjunction with the drawing 
Figures, in which: 

Rg. 1 is a block diagram showing a parallel 
processing system constructed in accordance with 
tiie present invention: 

Rg. 2 is a schematic diagram of a process- 
ing element from Rg. 1; 

Rg. 3 is a schematic diagram of a memory 
element from Rg. 1 : 

Rg. 4 is a block diagram showing tiie alloca- 
tion of memory locations witiitn tiie memory ele- 
ment of Rg. 3; 

Rgs. 5A, 5B comprise flow charts showing a 



simplified example of parallel processes utilizing 
spin loops in accordance with the prior art: 

Rgs. 6A. 6B and 6C comprise flow charts 
illustrating the execution of a Compare And Notify 
5 directive in accordance with the present invention; 
and 

Rgs, 7A, 7B comprise flow charts showing a 
simplified example of parallel processes running on 
tiie system of Rg. 1 and utilizing Compare And 
10 Notify directives in accordance witii tiie present 
invention. 

The present invention will be described first 
with reference to a multiprocessing system, and 
subsequentiy with reference to a uniprocessor pro- 

15 cessing system. 

Referring now to Rg. 1. a simplified block 
diagram of a multiprocessor system 10 is shown 
including a plurality N of processing elements (PE). 
indicated at 12-1 through 12-N, connected to a 

20 plurality M of memory elements (ME), indicated at 
14-1 ttirough 14M through an interconnection net- 
work 16. Processing elements 12 and memory 
elements 14 are interconnected by network 16 so 
as to permit duplex data communication there- 

25 between. As discussed in further detail below, sys- 
tem 10 is structured to accommodate parallel pro- 
cessing of tasks amongst processing elements 12, 
tiie processing elements having access to shared 
data distributed between memory elements 14, and 

30 the various tasks being, under the control of a 
single operating system (not shown). A variety of 
operating systems supporting such parallel pro- 
cessing are known to those skilled in the art. and 
the details of any one particular operating system 

35 are not necessary to an understanding of tiie . 
present invention. 

Referring now to Fig. 2. the features of one 
exemplary processing element 12-N are shown in 
greater detail. Processing element 12-N includes a 

40 computing engine 18. preferably including an 
arithmetic logic unrt (ALU) and a floating point 
processor (FPP). conventional features well known 
to those skilled in the art. A memory mapping 
controller 20 is connected to computing engine 18 

45 for mapping tiie memory addresses generated by 
tiie computing engine to memory elements 14 (Rg. 
1 ) including a local memory space 46A designated 
for the exclusive use of each processing element 
12 (but not necessarily located with the processing 

50 element hardware). An interface 22 is connected to 
memory mapping controller 20 for providing a logi- 
cal interface with interconnection network 18. A 
controller 24 is associated with each processing 
element 12 for controlling the operation of Uie 

55 various components therein. It will be understood 
that, while the interconnections between the various 
elements may be Indicated with single lines, the 
drawings are intended to show logical connections. 
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and the actual signa) lines nnay comprise buses of 
nnulttpie conductors supporting multiple, parallel 
signals. 

In the operation of processing element 1 2. con- 
trol logic 24 functions to control computing engine 
18 in accordance with directives stored in«memory 
including local memory 4eA. Memory mapping 
controller 20 functions to convert memory address- 
es generated by computing engine 18 to accom- 
modate various memory paging and interleaving 
schemes as discussed in further detail below. Inter- 
face 22 provides a logical interface for exchanging 
digital data between memory mapping controller 20 
and interconnection network 16. 

The processing element 12 shown in Rg. 2 is 
intended to represent, for the purposes of describ- 
ing the present invention, a generic type of pro- 
cessing element. Many specific types of process- 
ing elements, for example the Intel model 80386. 
are well known to those skilled in the art. For a 
descnption of the processing element utilized in 
the IBM RP3 parallel processing system described 
in the documents referenced hereinabove, the 
reader is referred to an article in the IEEE PRO- 
CEEDINGS OF THE 1985 INTERNATIONAL CON- 
FERENCE ON PARALLEL PROCESSING, pgs. 
782-789. "RPS Processor-Memory Element", by 
W. C. Brantley et al. 

Referring now to Rg. 3. a memory element 14 
IS shown constructed in accordance with the 
present invention. Memory element 14 includes 
four digital data comparators indicated at 30. 32. 
34. 36. three time domain multiplexers (mux) in- 
dicated at 38. 40. 42. a digital adder 44. a main 
memory 46 comprised of dynamic random access 
memory (DRAM) chips, and an associative memory 
system including an associative address controller 
48 and an associative memory 50 also compnsed 
of DRAMS. It will be appreciated by those skilled in 
the art that ail of the above-descnbed components 
are conventional in the digital signal processing 
field. Information on the construction and operation 
of associative memories can be found, if desired, m 
the book: "Computer Architecture and Parallel Pro- 
cessing", by Hwang and Bnggs. McGraw-Hill. Inc.. 
1984. pgs. 375-380. Further included with memory 
element t4 is control logic 52 for controlling the 
operation of the memory element in a manner 
described in detail hereinbelow. 

Continuing to describe the construction of 
memory element 14. main memory 46 has its 
address mput connected to the output of mux 42. 
Memory 14 has its data input connected to a 
OATA^WRITE signal line in parallel with an input 
to mux 38. an input to comparator 34. a data input 
D2 to associative memory 50 and an input to 
comparator 32. The DATA_WRITE signal includes 
the data, for example an 8-bit byte, to be written 



into memory 14. Mam memory 46 has its data 
output connected in parallel to an input of mux 38 
and an mput of comparator 32. Mux 42 has its first 
input connected to an ADDRESS signal line in 

5 parallel with an input to comparator 30. a data input 
D1 to memory 50. and an input to associative 
address controller 48. The ADDRESS signal in- 
cludes the address for the memory location to be 
accessed in memory 46. MUX 42 has its second 

fo input connected to an output 01 of memory 50 in 
parallel with an input to comparator 30. 

Associative memory 50 has its address input 
connected to an output of associative address con- 
troller 48. and a data input 03 connected to the 

;5 output of adder 44. A data input 04 of associative 
memory 50 is connected to a requesting processor 
Identification PID_REQ signal line in parallel with 
an input to mux 40, the second input to the mux 
being connected to associative memory output 04. 

20 The PI0_REQ signal includes the identification of 
a requesting processing element 12 and. where 
appropnate. identification of a particular task per- 
formed by that processing element. Output 03 of 
associative memory 50 is connected to an input to 

2S comparator 36. and output 02 of the memory is 
connected to an input of comparator 34. Compara- 
tor 36 includes a second input connected to a 
MASTER_TIME_CLOCK signal in parallel with an 
input to adder 44. the second input to the adder 

30 being connected to a TIMEOUT signal line. The 
MASTER_T1ME_CL0CK signal is a real time 
counter distnbuted throughout system 10 by a 
clock bus (not shown). The TIMEOUT signal, pro- 
vided by the tasking processing element 12. pro- 
as, vides the time during which memory element 14 is 
to repeatedly execute the compare and notify 
(CAN) directive described below. 

Control logic 52 receives three control signals 
generated externally to memory element 14 by 

40 processing elements 12: a VALID^IN signal in- 
dicating that incoming address and data signals are 
valid, a REQUEST_TYPE signal indicating the 
type of memory access desired (i.e. READ. LOAD, 
etc.), and an ACCEPTED signal indicating that data 

45 transmitted from memory element 14 to a process- 
ing element has been accepted. Including the AD- 
DRESS. DATAWRITE. TIMEOUT, and PID_REQ 
signals, memory element 1 4 receives seven signals 
from processing element t2. 

50 Control logic 52 generates two control signals 
for transmission to processing elements 1 2 external 
to memory element 14: a clear to send CTS signal, 
indicating that the memory is free to accept a 
request, and a VALID^OUT signal indicating that 

55 valid data is being transmitted at the outputs of 
muxs 38 and 40. Including the OATAREAD signal 
at the output of mux 38, and the PID__RESP signal 
at the output of mux 40. memory element 14 thus 
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generates four signals for transmission to process- 
ing element 12. Control 52 further receives a sys- 
tem clock pulse SYSTEM CLOCK. 

Interna) to memory element 14. control logic 52 
is connected to receive five signals: the output 
signals from comparators 30. 32. 34. and 36. and 
data (described below) from associative address 
controller 48. Control logic 52 is connected to 
transmit control signals to the following units within 
memory element 14: mux 40, memory 50, associ- 
ative address controller 48. main memory 46. and 
mux 38. 

Referring now to Rg. 4, a simple schematic 
diagram is used to illustrate a preferred method of 
relegating the memory space within a main mem- 
ory 46 . For the purposes of Fig. 4. it will be 
understood that mam memory 46' includes the 
entirety of the main memory distributed across 
memory elements 14, i.e. the total memory space 
provided by the sum of main memories 46 in each 
of the M memory elements 14. The lower ad- 
dresses of memory 46 . indicated at 46B, are gen- 
erally designated for the operating system kernel. 
A high address memory space, indicated generally 
at 46C. is designated as shared memory space 
accessible by any of processors 12. and may con- 
tain, for example, semaphores (or flags) 54. 56, and 
shared data such as the count shown at 58. Be- 
tween higher address shared memory space 46C 
and lower address memory space 468 occupied 
by the operating system, a memory space 46A 
(see also Fig. 2) is designated as unshared (or 
private) memory associated with a specific pro- 
cessing element 12. It will be understood that these 
regions 46A. 468. 46C of the memory are not fixed 
in size, and needn't be contiguous in nature. They 
represent one preferred method of allocating the 
memory resources within the parallel processing 
system of the present invention. 

While the memory elements 14 have been 
shown as M discrete elements each supporting 
DRAM memory 46, it will be understood that in an 
actual implementation the memory can comprise 
one or more elements, and is typically paged and 
interleaved in accordance with many such schemes 
well known to those skilled in the art. Memory 
mapping controller 20 of processing element 12 
functions to convert the address generated by 
computing engine 18 in accordance with the se- 
lected paging and/or interleaving scheme. For a 
more thorough discussion of memory addressing in 
parallel processing systems, the reader is directed 
to "Computer Architectu.^'e and Parallel . Process- 
ing", by Hwang and Briggs. McGraw-Hill. Inc. 1984. 
pgs. 60-118. A detailed description of the operation 
of memory elements 14 in parallel processing sys- 
tem 10 is set out hereinbelow with reference to 
Rgs. 6-8. 



Many of the elements shown in Rg. 3 are 
descnbed below as located "in" the memory ele- 
ments, and many of the functions performed by 
these elements are descnbed as being performed 

5 "in" the memory elements. Referring back to Rg. 
1. the descnption of an element and/or function as 
"in" a memory element is defined to mean the 
elements and functions so descnbed are located 
functionally outside of the interconnection network 

10 ports on the side of the memory elements. The use 
of the descriptor "in" is not intended to imply any 
necessary physical relationship such as board, 
chip, or cabinet location. 

Referring again back to Rg. 1. the type of 

}5 interconnection network 16. utilized in system 10 is 
not relevant to the understanding or practice of the 
present invention. Refernng to the articles on "hot 
spots" referenced above, the reader will see that 
the problem addressed by the present invention 

20 arises in parallel processing systems regardless of 
the type of interconnection network used. Whether 
interconnection network 16 is selected to be a bus 
network, a matrix switching network, a multi- 
bus multiport network, or some combination of the 

25 above, it must include a tree structure for intercon- 
necting multipIeT sinks (i.e. processors, I/O devices, 
etc.) to every shared memory location. Hence, re- 
gardless of the type of interconnection network 
used, the present invention can be utilized to mini- 

30 mize the occurrence of hot spots resulting from 
memory access contentions. As an example of one 
particular type of interconnection network, the read- 
er is directed to an article describing that used in 
the IBM RP3 parallel processor discussed above: 

35 IEEE PROCEEDINGS OF THE 1985 INTERNA- 
TIONAL CONFERENCE ON PARALLEL PRO- 
CESSING, pages 764-771, "The IBM Research 
Parallel Processor Prototype (RP3) : Introduction 
and Architecture", by Q.F. Pflster. et al. 

40 Referring now to Fig. 5. including Rgs. 5A. 5B. 
a simple FORKJOIN process is shown using spin 
loops in accordance with the prior art. It is to be 
understood that the present invention, and m par- 
ticular the operation of memory element 14 of Rg. 

45 3. substantially solves the hot spot problems by 
eliminating the spin loops shown in Rg. 5, The 
prior art operation is descnbed at this point in the 
document only so that it may be clearly contrasted 
to the Improved operation in accordance with the 

50 operation of the invention described immediately 
thereafter. 

Figure 5 shows a flow chart of a process in- 
cluding a parent task indicated at 5A. and N child 
tasks indicated at 58. these parent and child tasks 
55 having been parsed by an operating system for 
parallel processing in multiple processing ele- 
ments. The parent task 5A initializes the data re- 
quired by the N child tasks' 58 for parallel process- 
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>ng. and then signals the child tasks with a count of 
"-I"* (a sennaphore) stored in a shared memory 
location indicated as FLAGl. The parent task then 
enters a spin loop, indicated in dashed-line at 60. 
wherein it repeatedly fetches the count stored in 
FLAGl to determine if ail of the children have 
completed thetr processing, i.e. if the stored count 
equals the number N of child tasks. The parent 
task will remain in this spin loop, repeatedly fetch- 
ing the count stored at the FLAGl memory location 
until the value read indicates the child tasks are ail 
completed. After the fetched value indicates that 
the child tasks are completed, the parent task will 
collect the results of the child tasks and store a "* 
1 " count in a separate FLAG2 memory location to 
indicate to the child tasks that they are all com- 
plete, and the data is collected. 

For their part, the N child tasks will initiate spin 
loops, indicated at 62. repeatedly fetching the con- 
tents at the FLAGl memory location until they 
receive the count indicating that they can 
begin processing (i.e. that the necessary data is 
initialized) Each child will perform its task, or 
"chunk" of work, incrementing the count in FLAGl 
upon its completion. Thereafter, each child will 
enter a second spin loop, indicated at 64. repeat- 
edly fetching the count at FLAG2 until it indicates 
the process is complete and the data collected. 
Three spin loops, indicated at 60. 62. 64. are thus 
established from a simple FORK JOIN operation of 
parallel processed parent child tasks. 

Continuing to describe the prior art of Fig. 5. 
each time a loop is executed m each of spin loops 
60. 62. 64 a memory location FETCH is initiated at 
the requesting processing element and propagated 
through an interconnection network to the memory 
location. The FETCH is performed as a primitive in 
the particular memory element storing the flag ad- 
dress to be read, and the value read from the 
memory location is transmitted back across the 
interconnection network to the requesting process- 
ing element. As increasing numbers of tasks are 
performed m parallel, increasing memory 
semaphonng is required to coordinate the tasks. As 
the memory semaphoring increases, the number of 
spin loops increases. And, as described above and 
in the referenced articles, as the number of spin 
loops increases, the greater the contention and 
hence interference will be for accessing memory 
locations. In the operation shown in Fig. 5, for 
example, it is seen that at some point all of the N 
childrerr will be sn contention for reading the values 
stored at FLAGl and FLAG2. this contention cre- 
ates the above-described hot spots, greatly inhibit- 
ing the ability of the interconnection network to 
transmit data, and thus greatly diminishing the pro- 
cessing efficiency of a multiprocessor system. It is 
important to understand that the creation of these 



hot spots causes memory access contention not 
only at the memory locations being accessed in 
the spin loops, but. due to the tree structure inher- 
ent in all of the mterconnection networks, interfer- 
5 ence exists when accessing any of the memory 
locations. Spin loops could thus be said to work in 
a synergistic manner to degrade system perfor- 
mance. 

The operation of the invention will now be 

10 descnbed with reference to Figs. 3. 6. the former 
showing the hardware implementation as described 
above and the tatter illustrating the process flow in 
the manner descnbed below. Figure 6 has been 
sutxiivided into three separate parts. Rg. 6A in- 

15 dicating functions performed in the requesting pro- 
cessing element 12 to generate the CAN directive, 
Fig. 6B indicating functions performed in the mem- 
ory element 14 to execute the directive as a pnmi- 
tive. and Fig. 6C indicating functions performed in 

20 the requesting processing element 12 during and 
after the execution of the CAN directive. As an 
overview, the operation generally consists of a 
Compare and Noofy (CAN) directive assembled 
and initiated by a requesting or tasking one of the 

25 processing elements 12. and executed as a pnmi- 
tive in a memory element 14 (of the type shown in 
Fig. 3) containing a designated address. The op- 
eration will be descnbed with respect to two dif- 
ferent scenarios: 1) an immediately successful 

30 CAN. and 2) a stored and re-executed CAN. 



SCENARIO 1. AN IMMEDIATELY SUCCESSFUL 
CAN 

35 

Refernng now to Figs. 3. and 6A. to initiate a 
CAN directive, a requesting processing element 12 
first assembles the directive, including four distinct 
data fields. The first field, a processing element 

40 and task identifier (PID_REO). includes data iden- 
tifying which of the processing elements 12 is 
requesting the CAN. If necessary to distinguish 
among multiple tasks performed in the requesting 
processing element 12, the first data field also 

45 includes a task identifier. The second data field is 
the address (ADDRESS) of the memory flag to be 
read. l e. the selected menrsi^y location m the se- 
lected memory element 14. The third data field is 
the companson value that the requesting micropro- 

50 cesser expects to find at the address of the mem- 
ory' flag. This third data field will arnve on the 
DATA_VvRITE signal line of Fig. 3 when the direc- 
tive «s transmitted. The fourth data field is a timeout 
value (TIMEOUT) defining the duration of time over 

55 which the CAN directive is to be repeated in the 
memory element if it is not immediately successful. 

Once the CAN directive is assembled it is 
transmitted to memory element 14 via an appro- 



15 



EP0 361 176 A2 



16 



priate electronic "handshake" with control unit 52. 
To provide this electronic handshake, the request- 
ing processing element 12 checks the clear to 
send CIS signal line to be sure the selected mem- 
ory element 14 is free to receive a directive, and 
then sends an appropriate REQUEST TYPE signal 
identifying a CAN directive along with a VALID IN 
signal and the assembled CAN directive. 

Referring now also to Fig. 6B. on the initial 
receipt of the CAN directive, the ADDRESS signal 
is gated through mux 42 and utilized by control 
logic 52 to read the selected location in main 
memory 46. The processor task identification 
PID_REQ is gated to the output of mux 40 
(PID__RESP) and used to transmit this first data 
read from memory 14 back to the requesting pro- 
cessing element on the DATA__READ signal line at 
the output of mux 38. Appropriate electronic hand- 
shaking is performed through the generating of a 
VALID OUT signal by the memory element, and a 
return of the ACCEPTED signal from the request- 
ing processing element. 

Referring now also to Fig. 6C. in the illustrated 
embodiment of the invention, a DATA_READ val- 
ue, comprising the actual contents of the selected 
location in memory element 14. is always returned 
to the requesting microprocessing element 12. 
which must make its own comparison between the 
expected (I.e. comparison) value and the returned 
value to determine if a match has occurred. Ac- 
cordingly, the requesting processing element com- 
pares the read value with the expected value (FIG. 
6C) and determines if they match and processing 
should continue. If no match is found, processing 
can wait or be switched to an alternate task. It will 
be appreciated that as an alternative to returning 
the contents of the memory location to the request- 
ing processing element 12. the comparison with 
the comparison value can be done in memory 
element 14, and a compare/no compare signal 
returned to the requesting processing element 12. 

Generally concurrent with the transmission of 
the DATA_READ signal back to the requesting 
processing element, the DATA^WRITE 
(comparison value) and DATA^READ (read value) 
signals are compared in comparator 32. Assuming 
for the sake of this scenario that they match, a 
successful match is noted by control logic 52 and 
the CAN directive is terminated. This termination 
includes discarding the CAN directive so that it is 
not repeatedly executed in the manner described 
below. Note that the associative memory system 
including associative address controller 48 and as- 
sociative memory 50 goes unused. 

It will be appreciated that the above-described 
scenario is roughly equivalent to the spin loop 
operation shown in Fig. 5 wherein the expected 
semaphore data is found on the first read of the 



memory location. Thus, the substantial advantages 
provided by the present invention are not imme- 
diately apparent from this scenario. 

5 

SCENARIO 2: A STORED AND RE-EXECUTED 
CAN 

This scenario is initiated in a manner identical 

10 to scenano 1 above. That is. the functions shown in 
Rg. 6A are performed identically, as are the 
-READ MEMORY LOCATION" and "TRANSMIT 
FIRST READ DATA TO PE" functions of Rg. 6B. 
However, in the present scenario the DATA__^READ 

IS does not compare with the DATA ^WRITE. and the 

"COMPARE WITH COMPARISON VALUE" test of 
Rg. 68 branches into the "NOT EQUAL" branch. 
Both the requesting processing element 12 and 
control logic 52 will recognize that there is no 

20 match, the former from doing a compare on the 
returned DATA^READ signal, the latter from the 
signal generated at the output of comparator 32. 
The requesting processing element 12 may then 
chose to wait, or to switch to an alternate task (FIG. 

25 6C). In either case, the requesting processing ele- 
ment 12 does not enter a spin loop as did the 
processing elements in the prior art described with 
respect to Fig. 5. 

Sensing that there was no match, control logic 

30 52 directs associative address controller 48 to store 
the processing element task identification 
(PID^^REQ). the selected memory location address 
(ADDRESS), the comparison value 
(DATA_WRITE). and the the sum of TIMEOUT 

3S plus MASTER TIME CLOCK (i.e. the EXPIRATION 
TIME signal at the output of adder 44) in an ad- 
dress of associative memory 50 via data inputs D4. 
D1. D2, D3, respectively. In a manner conventional 
to the operation of associative memory systems, 

40 the address m memory 50 selected for storing the 
CAN directive is associative with the selected 
memory location address (ADDRESS) of main 
memory 46. This operation of an associative mem- 
ory system will be well understood by those skilled 

45 in the art. and permits associative memory 50 to 
be substantially smaller than mam memory 46. The 
EXPIRATION TIME signal, generated by adder 44. 
comprises a subsequent time at which the CAN 
directive will be automatically terminated. 

so In accordance with the present invention, each 
time the contents of main memory 46 are altered, 
for example through a LOAD operation, the con- 
tents of associative memory 50 are examined, and 
if a CAN directive is found having a stored address 

55 value matching that of the altered memory location, 
the CAN directive is automatically executed as a 
primitive operation. Assuming then that a memory 
location in main memory 46 has been altered. 
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control logic 52 directs the reading of associative 
memory 50 and the companson of all addresses 
stored >n the CAN directives with the current AD- 
DRESS signal indicating the altered memory con- 
lent location This companson is performed in com- 
parator 30. the comparator receiving the CAN di- 
rective addresses from output 01 of associative 
memory 50 and the current ADDRESS signal di- 
rectly off of the signal line. 

If a CAN directive is found containing a mem- 
ory location address matching that of the current 
ADDRESS signal, the comparison value stored with 
the CAN directive is read out of associative mem- 
ory 50 on output 02 and compared to the contents 
of the altered memory location in main memory 46 
(i.e. the value of the OATA_WRITE signal). This 
companson is performed in comparator 34. Assum- 
ing first that the values don't compare, then no 
action IS taken with respect to the requesting pro- 
cessing element. It will be appreciated that these 
compares are automatically executed each time a 
location in mam memory 46 is altered, without 
requiring any action or memory access from the 
requesting processing elements 12 which originally 
generated the CAN directives. 

Assume now that the last descnbed compare 
in comparator 34 yields a match, i.e. that the 
companson value stored with the CAN directive in 
associative memory 50 matches the value of the 
DATA_WRITE Signal. In this circumstance, the 
PID_REQ signal (which has been read, concur- 
rently with the companson value of the CAN direc- 
tive, on output 04 of the associative memory) is 
gated to the output of mux 40. The PID^REO 
signal is used to return the current value of the 
updated location in mam memory 46 to the re- 
questing processing element 12. The value is re- 
turned, of course, using the appropriate electronic 
handshaking as described above. It wilt be appre- 
ciated that this returned value, gated as the 
DATA_READ signal to the output of mux 36. can 
be provided either from the DATA_WRITE signal 
or the actual data read at the output 0 of mam 
memory 46, both being equal. The requesting pro- 
cessing element 12. upon receiving the returned 
value, either releases from a wait state or pro- 
cesses the returned data via a conventional inter- 
rupt mechanism. Subsequently, when the request- 
ing task IS reactivated, it does it's own compare to 
determine that the companson value has been 
matched and hence the condition has been met. 
.Again, it is !mpo.''ta.nt to note that all of the above 
executions of the CAN directive stored in associ- 
ative memory 50 have occurred without any mem- 
ory accesses or other action on the part of the 
requesting processing element 1 2. 

Describing now the utilization of the TIMEOUT 
data provided with the CAN directive, this data is 



first used to provide the EXPIRATION TIME value 
in the manner described above. This EXPIRATION 
TIME value is stored with the CAN directive in 
associative memory 50. Subsequently, during cy- 
5 cles of system 10 wherein the contents of main 
memory 46 are not altered, control logic 52 func- 
tions to read the stored CAN directives from asso- 
ciative memory 50 and compare the stored EX- 
PIRATION TIME data with the MASTER TIME 

w CLOCK data at comparator 36. These operations 
are shown in Fig. 6B at the "NO" branch of the 
"MAIN MEMORY LOCATION UPDATE" test When 
the EXPIRATION TIME for a given CAN has been 
exceeded, the address stored with that CAN is 

/5 gated through mux 42 to read the current value of 
the selected mam memory location. This current 
value is gated as the DATA_READ signal to the 
output of mux 38. Generally concurrently, the iden- 
tification of the requesting processing element 12, 

20 P1D_REQ. IS gated from associative memory 50 to 
the output of mux 40. The PID_REQ signal is then 
used to return the DATA_REAO signal to the re- 
questing processing element 12. using, of course, 
the appropriate electronic handshaking descnbed 

25 above. The requesting processing element 12 then 
does Its own compare between the value returned 
from memory element 14 and its own comparison 
value, and determines that there is no match. The 
requesting processing element 12 can then enter 

30 an error recovery state (not shown). The CAN di- 
rective is subsequently terminated by erasing it 
from associative memory 50 so that it is not subse- 
quently read or executed. It will be appreciated that 
the above-described timeout of the CAN directive 

35 was executed with no memory access or other 
action on the part of the requesting processing 
element 12. 

Several features of the present invention will 
become apparent from a consideration of the 

40 above. First, while the operation of parallel process- 
ing system lO has been generally descnbed with 
respect to the generation and execution of a single 
CAN directive, it will be appreciated that many 
CAN directives can be stored m associative mem- 

45 ory 50 and executed by memory element 14. 
These CAN directives can ongmate from the same 
or different processing elements 12. and more than 
one CAN directive can be directed to the same 
selected memory location m main memory 46. 

50 Such would be the case, for example, when mul- 
tiple child tasks being performed m different pro- 
cessing elements 12 are all waiting on a particular 
semaphore to arrive at the same memory location. 
For memory element 14 to function in the manner 

55 descnbed above, it may be desirable for the mem- 
ory element to operate at an internal clock rate 
faster than that of the mam system clock for par- 
allel processing system 10. Such a speed increase 
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would, of course, be determined by the size of 
associative memory 50 and the number of CAN 
directives storable there. 

Refemng now to Rgs. 7A. 7B a simple 
FORKi JOIN operation is shown including a parent 
task at 7A and a plurality N of child tasks at 78. 
This FORK. JOIN operation is similar to that shown 
and described with respect to Fig. 5 above, except* 
ing that the process is shown now using the CAN 
directive of the subject Invention. 

Referring first to Fig. 7A. the processing ele- 
ment 12 running the parent task initializes data 
required to run the child tasks, and then stores a 
count of "-r at the shared memory location 
FLAG1. The processing element 12 then generates 
a CAN directive to detennine when ail of the child 
tasks are complete. This CAN directive includes a 
comparison value equal to the number of child 
tasks, the identification of the requesting process- 
ing element 12 (and that of the task if appropriate), 
the address of FLAG1 to be monitored, and a 
TIMEOUT value determined by the particular pro- 
cess being run. If the CAN directive is not imme- 
diately successful, i.e. the compare value and the 
count in the FLAG1 address do not match, then the 
requesting processing element 12 can enter a wait 
state or switch to an alternate task. If the request- 
ing processing element 12 is subsequently inter- 
rupted with returned data indicating a match, it can 
collect the results generated by the child tasks and 
enter a count "-r in FLAG2 indicating the same to 
the child tasks. If the requesting processing ele- 
ment is subsequently interrupted with returned data 
indicating that a timeout has occurred, \X can enter 
a failure recovery routine (not shown). It is impor- 
tant to note that, regardless of the result brought 
about by the CAN directive, in accordance with a 
main advantage of the present invention, no spin 
loops are required. 

Examining now the child tasks shown in Rg. 
7B. as each task is first initiated, its processing 
element 12 generates a CAN directive to determine 
if the necessary data has been initialized. This 
CAN directive includes a compare value equal to 
r (the semaphore generated by the parent task 
when the data is initialized), the requesting pro- 
cessing element 12 identification\(and task iden- 
tification as required), the address of FLAGI . and a 
TIMEOUT value determined by the process. If the 
data returned as a result of the first execution of 
the CAN directive indicates there is no match be- 
tween the comparison value and the data stored at 
FLAGI. the requesting processing element 12 can 
enter a wait state or switch to an alternate task. If 
data is subsequently returned (as a result of a 
subsequent execution of the CAN directive in se- 
lected memory element 14) indicating a timeout, 
the requesting processing element 12 can enter an 



error recovery mode (not shown). If data is subse- 
quently returned (again as a result of a subsequent 
execution of the CAN directive in selected memory 
element 14) indicating a match between the com- 
5 parison value and the data stored at FLJVGI. the 
requesting processing element 12 can continue to 
perform its "chunk" of work. Incrementing the 
counti at FLAG1 when it is finished. 

Subsequent to the completion of its program, 

10 the processing element 12 running each child task 
generates a second CAN directive to determine if 
the data generated by that child task has been 
cibllected by the parent task. This CAN directive 
includes a companson value of a requesting 

15 processing element identification (and a task iden- 
tification where appropriate), the address of FLAG2. 
and a TIMEOUT value determined by the process. 
If data returned to the requesting processing ele- 
ment 12 as a function of the first execution of the 

20 CAN directive indicates that there is no match 
between the comparison value and the value stored 
at FLAGI, the requesting processing element 12 
can enter a wait state or switch to an alternate task. 
If data is subsequently returned (as a function of a 

25 subsequent execution of the CAN directive in se- 
lected memory element 14) indicating that a 
timeout has occurred, the requesting processing 
element 12 can initiate an error recovery process 
(not shown). If data is subsequently returned (as a 

30 function of a subsequent execution of the CAN 
directive in selected processing element 14) in- 
dicating a match between the comparison value 
and the value stored at the FLAG2 memory loca- 
tion, the requesting processing element 12 can 

35 continue on to the next task. 

It will thus be appreciated that, in contrast to 
the FORK JOIN process described with respect to 
Rg. 5 above, when parallel processing system 10 
is operated in accordance with the present inven- 

40 tion. at no time are the parent or child tasks illus- 
trated in Rg. 7 required to enter a spin loop. The 
present invention does, in fact, obviate the vast 
majority of the spin loops- used in the prior art. 
Thus, in the parallel processing system 10 con- 

45 structed and operated m accordance with the 
present invention, a substantially decreased num- 
ber of memory location accesses are required. This 
decrease in the required number of memory loca- 
tion accesses functions to directly decrease the 

50 usage of interconnection network 16. The de- 
creased usage of interconnection network 16 re- 
duces the number of contentions or interferences 
on the network, substantially eliminating the forma- 
tion of "hot spots". 

55 In accordance with another embodiment of the 
present invention, under some operating conditions 
it is desirable for a processing element 12 to moni- 
tor the contents of a selected memory location. 
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regardless of whether those contents match a com- 
parison value. In such circumstances, the present 
invention is utilized to have the processing element 
12 send a directive to the selected memory ele* 
ment. the directive including the memory location 
and an identification of the requesting processing 
element (and task, if appropriate). The selected 
memory element 14 then operates on the directive 
to return read data whenever a selected memory 
location is altered, regardless of whether such read 
data matches a comparison value. It will be appre- 
ciated that the construction and operation of this 
embodiment of the invention is substantially iden- 
tical to that descnbed above, excepting for the 
omission of the comparison data, the compare 
step, and of course, the hardware in which the 
compare is performed. Such operation would, of 
course, negate the requirement for a spin loop 
initiated by the requesting processing element 12 
to periodically read the memory location, thereby 
providing the above-described advantages. 

In accordance with yet another embodiment of 
the present invention, the CAN directives are bene- 
ficially utilized in a uniprocessor processing sys- 
tem. Such a uniprocessor system is substantially 
identical to that described above with respect to 
Figs. 1. 2. 3, and 4. above, with the exception that 
only a single processing element 12 and memory 
element 14 are provided, the pair being connected 
by an appropriate bus-type interconnection network 
16. The operation of such a uniprocessor system is 
likewise substantially identical to that shown and 
descnbed with respect to Figs. 6 and 7 above, 
excepting that each CAN directive must include a 
task identification, the vanous tasks being executed 
senally on the single processing element 12 (vs. in 
parallel on multiple processing elements as shown 
in Figs. 7A. 7B). 

When utilized in a uniprocessor system, the 
single processing element 12 is relieved of the 
polling chore necessary to monitor semaphores so 
as to synchronize multiple sequential tasks m the 
manner descnbed above. Instead, in accordance 
With the present invention, the status of the sema- 
phores tn memory element 1 4 are monitored using 
the CAN directive. Processing element 12 is thus 
freed to perform data processing. 

There is thus provided a new and improved 
method and apparatus for communicating data be- 
tween multiple tasks in processing systems 
through the use of memory directives operative to 
monitor the status of a semaphore at a memory 
location provided by a requesting processor. When 
utilized in a multiprocessor parallel processing sys- 
tem, the present invention provides for the commu- 
nication of data between a plurality of processing 
elements, through semaphores stored m a plurality 
of memory elements, without repetitious accessing 



of those memory elements over an interconnection 
network. The present invention thus substantially 
reduces hot spots otherwise formed within the in- 
terconnection network, increasing the speed and 

5 efficierKy of the parallel processing system. Rne 
grain parallelism is thus accommodated in large 
scale parallel processing systems. When utilized in 
a uniprocessor environment, the present invention 
provides for the communication of data between 

ro multiple tasks without high overhead polling by the 
processing element. The processing element is 
thus freed to more efficiently process data. 

rs Claims 

1 In a data processing system of the type 
including a processing element interconnected with 
at least one memory element by an interconnection 

20 means, a method accommodating the accessing of 
at least one selected memory location by at least 
one requesting task to read data stored thereat, 
said method being charactenzed by the steps of: 
sending from said processing element to said 

25 memory element a directive including 

1) the address of said selected memory location, 
and 

2) an identification of the requesting task; 
reading in said memory element the data stored at 

30 said selected memory location: 

sending the read data obtained by said reading 
step to said processing element with said task 
identification: 

storing said directive in said memory element; and 
35 repeating said reading and sending steps when the 
data in said selected memory location is altered. 

2. in a data processing system of claim 1. 
characterized by including a plurality of processing 
elements interconnected with each other and with 
40 at least one memory element by an interconnection 
means, a method accommodating the accessing of 
a selected memory location in a selected memory 
element by at least one requesting processing ele- 
ment to read data stored thereat, said method 
45 compnsing the steps of: 

sending, from said requesting processing element 
to said selected memory element via said intercon- 
necting means, a directive including 

1) the address of said selected memory location. 
50 and 

2) an identification of the requesting processing 
element sending said directive: 

reading in said selected memory element the data 
stored at said selected memory location; 
55 sending the read data to said requesting process- 
ing element via said interconnection means: 
stonng said directive m said memory element: and 
repeating said reading and sending steps when the 
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data in said selected memory location is altered. 

3. In a data processing system of claim 1 . said 
method further comprising the steps of: 

sending, from said processing element to said 
memory element, a directive including 
a companson value, and 

comparing in said memory element the read data 
with said companson value: 
notifying said processing element if the read data 
matches said comparison value: 
storing said directive in said selected memory ele- 
ment if the read data does not match said compari- 
son value: and 

repeating said reading, comparing, and notifying 
steps each time the data in said selected memory 
location is altered. 

4. In a data processing system of claim 2. said 
method comprising the steps of: 

sending, from said requesting processing element 
to said selected memory element via said intercon- 
necting means, a directive including 
a companson value, and 

comparing in said memory element the read data 
with said companson value; 
notifying said requesting processing element via 
said interconnection means If the read data match- 
es said comparison value: 

storing said directive in said selected memory ele- 
ment if the read data does not match said compari- 
son value: and 

repeating said reading, comparing, and notifying 
steps when the data in said selected memory loca- 
tion is altered. 

5. A method in accordance with claim 4 and 
further including the step of monitonng, in said 
selected memory element, the altering of the data 
stored in said selected memory location. 

6. A method in accordance with claim 4 and 
further including the step of returning the read data 
obtained by the first one of said reading steps to 
said requesting processing element regardless of 
whether the read data matches said comparison 
value. 

7. A method In accordance with clam 6 
wherein said step of sending a directive to said 
selected memory element further includes the step 
of sending a timeout value to saSd memory ele- 
ment: and 

terminating the execution of said directive when 
said timeout value is exceeded. 

8. A method in accordance with claim 6 
wherein said step of notifying said requesting pro- 
cessing element includes sending the data read in 
said reading step to said requesting processing 
element. 

9. A method in accordance witii claim 6 
wherein said including tiie step of assembling said 
directive in said processing element. 
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10. A method in accordance with claim 6 
wherein said storing step includes the step of stor- 
ing said directive associatively with the address of 
said selected memory location. 

11. A method in accordance with claim 6 
wherein said step of sending said directive includes 
sending an identification of a requesting task per- 
formed by said requesting processing element. 

12. In a data processing system of claim 1 or 2 
characterized by including 

at least one processing element interconnected 
with at least one memory element by an intercon- 
nection means, a method of executing a primitive 
instruction in accordance with a directive stored in 
said memory element, said method comprising the 
steps of: 

reading said directive including 
1} an address of a selected memory location, and 
2} an identification of a requesting processing ele- 
ment; 

reading tiie data at said address; 
sending the read data to said processing element: 
repeating said steps of reading said directive, read- 
ing the data, and sending the data when the data in 
said selected memory location is altered. 

13. A method in accordance with claim 12 
wherein said directive further includes an identifica- 
tion of a requesting task and said sending step 
furtiier includes sending said task identification to 
said requesting processing element. 
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